A latent variable model to make predictions

A forecast task

Let’s take the example of a time series of a given stock market index. We want to predict whether this index will rise or fall in the next time step.

##         date      measured target
## 1 2004-05-24 -0.2102793333     -1
## 2 2004-05-25  0.5418432556      1
## 3 2004-05-26 -0.0009872216     -1
## 4 2004-05-27 -0.7341004753     -1
## 5 2004-05-28  0.0795940608      1
## 6 2004-05-31 -0.1462255312     -1

Modelization

Consider the nokki dataset above. We want to predict the binary (1/-1) target variable while observing the measured variable. The target variable is considered a hidden Bernoulli-distributed variable while measured is the variable we do observe. We will denote the hidden variable with \(z\) and the observed variable with \(x\).

We construct a Hidden Markov Model (HMM). That model considers a discrete time which will be referred to by the index \(i\in\mathbb{N}\), each line of the dataset corresponding to a discrete time step, so that the index \(i\) as well counts the lines of the dataset, beginning with 1 for the first line.

The binary variables \(z_i\in\{1,\ldots,K\}\) with \(K=2\) are identically distributed \(\forall i\) and together form a discrete-state Markov chain. They are, as already said, hidden.

The variables \(x_i\) are also identically distributed and refer to the observations of the variable measured. They are considered normal distributed and relate to the \(z_i\) by a so called observation model \(p(\underline{x}_i \,|\, z_i)\).

The global relationship between the hidden variables \(z_i\) and the observed variables \(x_i\) is given by the following joint distribution:

\[ \begin{aligned} p(\underline{z}_{1:T} \,, \underline{x}_{1:T}) &= p(\underline{z}_{1:T})\cdot p(\underline{x}_{1:T} \,|\, \underline{z}_{1:T})\\ &= \left[ p(z_1) \cdot \Pi_{t=2}^T p(z_t \,|\, z_{t-1}) \right] \cdot \left[ \Pi_{t=1}^T p(\underline{x}_t \,|\, z_t) \right]\,, \end{aligned} \]

Implementation of the forward algorithm

Given our HMM is fitted, we can make predictions about the hidden variable. How does that work?

We give ourselves a Markov chain with known transition matrix. We describe how to recursively compute the filtered marginals \(p(z_t \,|\, \underline{x}_{1:t})\) in an HMM.

The algorithm has two steps.

The prediction step:

\[ p(z_t=j \,|\, \underline{x}_{1:\,t-1}) = \sum_i p(z_t=j \,|\, z_{t-1}=i) \cdot p( z_{t-1}=i \,|\, \underline{x}_{1:\,t-1}) \] The update step:

\[ \begin{aligned} \alpha_t(j) &:= p(z_t=j \,|\, \underline{x}_{1:\,t}) \\ &= p(z_t=j \,|\, \underline{x}_t, \, \underline{x}_{1:\,t-1}) \\ &= \frac{1}{p(\underline{x}_t \,|\, \underline{x}_{1:\,t-1})} \cdot p(\underline{x}_t \,|\, z_t = j, \, \underline{x}_{1:\,t-1} ) \cdot p(z_t = j \,|\, \underline{x}_{1:\,t-1}) \\ &= \frac{1}{\sum_j p(z_t = j \,|\, \underline{x}_{1:\,t-1}) \cdot p(\underline{x}_t \,|\, z_t = j)} \cdot p(\underline{x}_t \,|\, z_t = j) \cdot p(z_t = j \,|\, \underline{x}_{1:\,t-1}) \end{aligned} \]

At the third line: The relation

\[ p(\underline{x}_t \,|\, z_t = j, \, \underline{x}_{1:\,t-1} ) = p(\underline{x}_t \,|\, z_t = j) \]

is a consequence of the \(d\)-separation property of DAGs: the node \(z_t\) is considered observed insofar as the probability of \(x_t\) is conditional on \(z_t\). Thus the node \(z_t\) separates the DAG in two independent parts.

The denominator of the fraction is seen to be a normalization constant, which we will denote by \(Z_t\).

This process is known as the predict-upgrade cycle. The distribution \(p(z_t=j \,|\, \underline{x}_{1:\,t})\) is called the (filtered) belief state at time t, and is a vector of \(K\) numbers, often denoted by \(\underline{\alpha}_t\).

The proposed algorithm:

How to fit the HMM

Say we have observed the \(x\)-column up to line \(N\) of our dataset, but not the hidden variable, \(z\), then, having observed \(x_{1:N}:=\{x_1,\ldots,x_N\}\), we can determine the parameters of an HMM using maximum likelihood.

The likelihood function is obtained from the joint distribution by marginalizing over the hidden variables \(z_i\):

\[ p(x_{1:N}\,|\,\theta)=\sum_{z_{1:N}} p(x_{1:N},\, z_{1:N}\,|\,\theta) \]

The joint distribution \(p(x_{1:N},\, z_{1:N}\,|\,\theta)\) does not factorize over the summation index.

We see that while considering the following scheme. It shows that if the \(z_i\) where observed, then by \(d\)-separation, the parameter vector for the \(x_i\) and the one for the \(z_i\) would be independent given \(D\), the set of all observed nodes: \(\underline{\theta}_x\perp\underline{\theta}_z\,|\,D\). When the \(z_i\) are not observed, the independence does not hold.

The EM algorithm starts with some initial selection for the model parameters, which we denote by \(\theta^\text{ old}\). In the \(E\) step, we take these parameter values and find the posterior distribution of the hidden variables \(p(z_{1:N}\,|\,x_{1:N},\theta^\text{ old})\). We then use this posterior distribution in order to evaluate the expectation of the logarithm of the complete likelihood function, as a function of the parameters \(\theta\), to give the function \(Q(\theta,\theta^\text{ old})\) defined by:

\[ Q(\theta,\theta^\text{ old}) = \sum_{z_{1:N}} p(z_{1:N}\,|\,x_{1:N},\theta^\text{ old})\cdot p(x_{1:N},\,z_{1:N}\,|\,\theta)\,. \]

Notations

At this point, let us introduce some useful notations. We shall use \(\gamma(z_n)\) to denote the marginal posterior distribution of the hidden variable \(z_n\), and \(\xi(z_{n-1},z_n)\) to denote the joint posterior distribution of two successive hidden variables:

\[ \begin{aligned} \gamma(z_n) &:= p(z_n\,|\,x_{1:N},\theta^\text{ old}) \\ \xi(z_{n-1},z_n) &:= p(z_{n-1},\,z_n\,|\,x_{1:N},\theta^\text{ old})\,. \end{aligned} \]

Because the hidden variables can be represented as \(K\)-dimensional (in our case, \(K=2\)) binary variables using dummy coding, we define \(z_{nk}\) as the \(k\)-th component of such a \(K\)-binary vector. Thus \(z_{nk}\) takes the value 1 if and only if the variable \(z_n\) is equal to its \(k\)-th possible value among \(K\). When we identify the value set of \(z_n\) with \(\{1,\ldots,K\}\), when can write \(p(z_{nk}=1)=p(z_n=k)\) and \(p(z_{nk}=0)=p(z_n\neq k)\).

With that notation, we can write the transition probabilities \(A_{jk}\) defining the transition matrix \(A\) of the Markov chain: \(A_{jk}=p(z_{nk}=1\,|\,z_{n-1,\,j}=1)\). The conditional distributions are then

\[ p(z_n\,|\,z_{n-1},A) = \prod_{k=1}^K \prod_{j=1}^K A_{jk}^{z_{n-1,\,j}\,\cdot\, z_{nk}}\,, \]

the initial parent node being special in that it doesn’t have a parent node:

\[ p(z_i\,|\,\pi) = \prod_{k=1}^K \pi_k^{z_{1k}}\,, \]

where \(\sum_k\pi_k=1\).

With those notations, we can write the joint probability distribution more precisely:

\[ p(x_{1:N},\,z_{1:N}\,|\,\theta) = p(z_1|\underline{\pi})\cdot\left[\prod_{n=2}^N p(z_n\,|\,z_{n-1},A)\right]\cdot \prod_{m=1}^N p(x_m\,|\,z_m,\phi)\, \]

where \(\theta=\{\underline{\pi},A,\phi\}\), \(\phi\) being the parameter of the observation model \(p(x_m\,|\,z_m)\), for example \(\phi=\{\mu,\sigma\}\) for a gaussian observation model, as is the case for us.

The EM algorithm: the E-step

We have obtained above the following expression of the expected \(\log\)-likelihood:

\[ Q(\theta,\theta^\text{ old}) = \sum_{z_{1:N}} p(z_{1:N}\,|\,x_{1:N},\theta^\text{ old})\cdot p(x_{1:N},\,z_{1:N}\,|\,\theta)\,. \]

If we substitute the joint distribution \(p(x_{1:N},\,z_{1:N}\,|\,\theta)\) by the more precise expression obtained just above, and make use of the definitions of \(\gamma\) and \(\xi\), we obtain:

\[ Q(\theta,\theta^\text{ old}) = \sum_{k=1}^K \gamma(z_{ik})\cdot\log(\pi_k) + \sum_{n=2}^N \sum_{j=1}^K \sum_{k=1}^K \xi(z_{n-1,\,j},\,z_{nk})\cdot\log(A_{jk}) + \sum_{n=1}^N \sum_{k=1}^K \gamma(z_{nk})\cdot\log(p(x_n\,|\,\phi_k))\,. \]

The goal of the E-step is to evaluate the quantities \(\gamma(z_n)\) and \(\xi(z_{n-1},\,z_n)\) efficiently. For that purpose, we well use the alpha-beta algorithm, a common variant of the backward-forward algorithm family.

The EM algorithm: the M-step

We have to maximize \(Q(\theta,\theta^\text{ old})\) with respect to the parameters \(\theta=\{\underline{\pi},A,\phi\}\). For doing that we treat \(\gamma(z_n)\) and \(\xi(z_{n-1},\,z_n)\) as constants.

Maximization yields:

\[ \begin{aligned} \pi_k &= \frac{\gamma(z_{1k})}{\sum_{j=1}^K \gamma(z_{ij})}\quad, \\ A_{jk} &= \frac{\sum_{n=2}^N \xi(z_{n-1,\,j},\,z_{nk})}{\sum_{q=1}^K \sum_{n=2}^N \xi(z_{n-1,\,j},\,z_{nq})}\quad. \end{aligned} \]

The EM algorithm must be initialized by choosing starting values for \(\underline{\pi}\) and \(A\) which respect the summation constraints associated with their probabilistic interpretation.

To maximize \(Q(\theta,\theta^\text{ old})\) with respect to \(\phi_k=\{\mu_k,\sigma_k\}\) we can proceed independently, insofar as the \(\phi_k\) are independent for the different components.

Maximization yields:

\[ \begin{aligned} \mu_k &= \frac{\sum_{n=1}^N \gamma(z_{nk})\cdot x_n}{\sum_{n=1}^N \gamma(z_{nk})} \\ \Sigma_k &= \frac{\sum_{n=1}^N \gamma(z_{nk})\cdot(x_n-\mu_k)\cdot(x_n-\mu_k)^T}{\sum_{n=1}^N \gamma(z_{nk})}\quad. \end{aligned} \]

The alpha-beta algorithm

So far, it remains us to compute the \(\gamma\) and \(\xi\) values in order to proceed with the EM algorithm, which in turn will fit our Hidden Markov model (HMM) for the Nokki dataset. With our fitted model, we will then obtain predictions using the online filtering forward algorithm.

We shall omit the dependence on the parameter vector \(\theta^\text{ old}\) because it remains fixed throughout.

Making use of the graph properties specific to the chosen model

By applying the \(d\)-separation on the HMM graph, we get the following conditional independence properties:

\(p(x_{1:N}\,|\,z_n) = p(x_{1:n}\,|\,z_n)\cdot p(x_{n+1:N}\,|\,z_n)\)

Indeed, every path from anyone of the nodes \(x_1,\ldots,x_{n-1}\) to the node \(x_n\) passed through the node \(z_n\), which is observed.
Similar explanations yield:

\(p(x_{1:n-1}\,|\,x_n,z_n) = p(x_{1:n-1}\,|\,z_n)\)
\(p(x_{1:n-1}\,|\,z_{n-1},z_n) = p(x_{1:n-1}\,|\,z_{n-1})\)
\(p(x_{1:N}\,|\,z_{n-1},z_n) = p(x_{1:n-1}\,|\,z_{n-1})\cdot p(x_n|z_n)\cdot p(x_{n+1:N}\,|\,z_n)\)

Evaluation of \(\gamma(z_{n})\)

We recall that for a discrete multinomial variable like \(z_n\), which has been dummy recoded as a vector variable \(z_n=(z_{n1},\ldots,z_{nK})^T\), the expected value of one of its components \(z_{nk}\) is just the probability of that component having the value 1: \(E[z_{nk}]=p(z_{nk}=1)\). We thus interested in the posterior distribution \(p(z_n\,|\,x_{1:N})\) of \(z_n\) given the whole observed dataset. This represents a vector of length \(K\) whose entries are the expected values of \(z_{nk}\), \(k=1,\ldots,K\). We thus have:

\[ \gamma(z_n) = p(z_n\,|\,x_{1:N}) = \frac{p(x_{1:N}\,|\,z_n)\cdot p(z_n)}{p(x_{1:N})}. \]

Note that the denominator is implicitely conditioned on \(\theta^\text{ old}\) and hence is the likelihood function of the HMM.

Using the first of the fist of the above given independence properties, we get:

\[ \begin{aligned} \gamma(z_n) &= \frac{p(x_{1:n}\,|\,z_n)\cdot p(x_{n+1:N}\,|\,z_n)}{p(x_{1:N)}} \\ &= \frac{\alpha(z_n)\cdot \beta(z_n)}{p(x_{1:N)}}\quad, \end{aligned} \]

where we define

\(\alpha(z_n):=p(x_{1:n}\,,z_n)\),
\(\beta(z_n):=p(x_{n+1:N}\,|\,z_n)\).

\(\alpha(z_n)\) represents the joint probability of observing all of the given data up to time \(n\) as well as the value of \(z_n\), whereas \(\beta(z_n)\) represents the conditional probability of all future data from time \(n+1\) to \(N\) given the value of \(z_n\). Each \(\alpha(z_n)\) and \(\beta(z_n)\) are sets of \(K\) numbers, one for each of the possible settings of the dummy coded binary \(K\)-dimensional vector \(z_n\).

We now come to recursion relations about both \(\alpha(z_n)\) and \(\beta(z_n)\).

\[ \begin{aligned} \alpha(z_n) &= p(x_{1:n}\,,z_n) \\ &= p(x_{1:n}\,|\,z_n)\cdot p(z_n) \\ &= p(x_n|z_n)\cdot p(x_{1:{n-1}}\,|\,z_n)\cdot p(z_n) \\ &= p(x_n|z_n)\cdot p(x_{1:{n-1}}\,,z_n) \\ &= p(x_n|z_n)\cdot \sum_{z_{n-1}}p(x_{1:{n-1}}\,,z_{n-1},z_n) \\ &= p(x_n|z_n)\cdot \sum_{z_{n-1}}p(x_{1:{n-1}}\,,z_n\,|\,z_{n-1})\cdot p(z_{n-1}) \\ &= p(x_n|z_n)\cdot \sum_{z_{n-1}}p(x_{1:{n-1}}\,|\,z_{n-1})\cdot p(z_n\,|\,z_{n-1})\cdot p(z_{n-1}) \\ &= p(x_n|z_n)\cdot \sum_{z_{n-1}}p(x_{1:{n-1}}\,,z_{n-1})\cdot p(z_n\,|\,z_{n-1}) \\ \end{aligned} \]

Thus

\[ \alpha(z_n) = p(x_n|z_n)\cdot \sum_{z_{n-1}}\alpha(z_{n-1})\cdot p(z_n\,|\,z_{n-1})\,. \]

The summation \(\sum_{z_{n-1}}\ldots\) contains \(K\) terms, one for each possible value that the variable \(z_n\) can take. The expression \(\alpha(z_n)\) has to be evaluated for each of the \(K\) possible values that \(z_n\) can take. Thus each step of the \(\alpha\) recurson has computational cost \(\mathcal{O}(K^2)\).

In order to start this recursion, we need an initial condition which is given by

\[ \begin{aligned} \alpha(z_1) &= p(x_1,z_1) \\ &= p(z_1)\cdot p(x_1|z_1) \\ &= \prod_{k=1}^K \big(\pi_k\cdot p(x_1\,|\,\phi_k)\big)^{z_{1k}}\,, \end{aligned} \]

which tells us that \(\alpha(z_{1k})\) must be set equal to \(\pi_k\cdot p(x_i\,|\,\phi_k)\). Starting at the first node of the chain, we can evaluate \(\alpha(z_n)\) for every hidden node \(z_n\).

Initialization

Our hidden variable \(z\) is the Nokki variable target, which has only two possible values, therefore \(K=2\). Our model has parameters \(\theta=\{\underline{\pi},A,\phi\}\) that we initialize arbitrarily with \(\pi=(0.5,0.5)^T\) (observing the sum-to-one constraint), \(A=\begin{bmatrix}1 & 0 \\0.5 & 0.5\end{bmatrix}\) (observing the sum-to-one constraint for each row), and \(\phi=\{\mu=(0,0)^T, \sigma=(1,1)^T\}\), where \(\mu_k\) is the \(k\) component of the vector \(\mu\), and analogous with the vector \(\sigma\).

The parameters \(\underline{\pi}\) and \(A\) define the hidden Markov chain itself, whereas \(\phi\) defines the observation model, which is gaussian.

# Initialization
N <- 8 # For the sake of simplicity, we consider only the first nodes of the Markov chain.
K<-2
pii <- c(0.5,0.5)
A <- matrix(c(1, 0, 0.5, 0.5), nrow = 2, ncol = 2)
mu <- c(0,0)
sigma <- c(1,1)
# The first observed variable, x_1:
nokki$measured[1]

## [1] -0.2102793

The alpha recursion

For the node \(n\), the quantity \(\alpha(z_n)\) will be represented by a \(K\)-dimensional vector, and the whole \(\alpha\) recursion over the nodes will produce a \((N,K)\)-dimensional dataset, the \(n\)-th line associated with the \(n\)-th hidden node \(z_n\) of the Markov chain.

# Initialization of the alpha recursion. The first step
alpha <- data.frame(matrix(0, nrow = N, ncol = K))
colnames(alpha) <- c("k=1", "k=2")
alpha[1,1] <- pii[1]*dnorm(nokki$measured[1], mu[1], sigma[1])
alpha[1,2] <- pii[2]*dnorm(nokki$measured[1], mu[2], sigma[2])

For subsequent steps from the 2nd, we implement

\[ \alpha(z_n) = p(x_n|z_n)\cdot \sum_{z_{n-1}}\alpha(z_{n-1})\cdot p(z_n\,|\,z_{n-1})\,. \]

where

\[ \begin{aligned} p(x_n|z_n) &= p(x_n|z_n,\phi) \\ &= \prod_{k=1}^K p(x_n|\phi_k)^{z_{nk}} \\ &= \prod_{k=1}^K \mathcal{N}(x_n|\mu_k,\sigma_k)^{z_{nk}} \\ \end{aligned} \]

# From the 2nd step to all next steps
for(n in 2:N){
  for(k in 1:K){
    j <- 1:K
    alpha[n,k] <- dnorm(nokki$measured[n],mu[k],sigma[k]) * sum(alpha[n-1,j]*A[j,k])
  }
}
alpha

##            k=1          k=2
## 1 0.1951094857 0.1951094857
## 2 0.0672102171 0.0672102171
## 3 0.0268129842 0.0268129842
## 4 0.0081702211 0.0081702211
## 5 0.0032491383 0.0032491383
## 6 0.0012824347 0.0012824347
## 7 0.0004583327 0.0004583327
## 8 0.0001778953 0.0001778953

Similarly, there exists a recursion for the \(\beta(z_n)\). By making use of the conditional independence properties, we get:

\[ \begin{aligned} \beta(z_n) &= p(x_{n+1:N}\,|\,z_n) \\ &= \sum_{z_{n+1}} p(x_{n+1:N}\,,z_{n+1}\,|\,z_n) \\ &= \sum_{z_{n+1}} p(x_{n+1:N}\,|\,z_n,z_{n+1})\cdot p(z_{n+1}|z_n) \\ &= \sum_{z_{n+1}} p(x_{n+1:N}\,|\,z_{n+1})\cdot p(z_{n+1},z_n) \\ &= \sum_{z_{n+1}} p(x_{n+2:N}\,|\,z_{n+1})\cdot p(x_{n+1}|z_{n+1})\cdot p(z_{n+1}|z_n)\,. \\ \end{aligned} \]

Thus

\[ \beta(z_n)=\sum_{z_{n+1}} \beta(z_{n+1})\cdot p(x_{n+1}|z_{n+1})\cdot p(z_{n+1}|z_n)\,. \]

Again we need a starting condition for the recursion, namely a value for \(\beta(z_N)\). To find it, we consider the relationship

\[ \gamma(z_n) = \frac{\alpha(z_n)\cdot \beta(z_n)}{p(x_{1:N)}} \]

by setting \(n=N\). It yields

\[ \gamma(z_N) = \frac{\alpha(z_N)}{p(x_{1:N)}}\cdot \beta(z_N) = \gamma(z_N)\cdot \beta(z_N)\,, \]

a relation that holds only for \(\beta(z_N) = 1\).

The first step of the beta backward recursion

# Initialization of the beta recursion. The first step
beta <- data.frame(matrix(0, nrow = N, ncol = K))
colnames(beta) <- c("k=1", "k=2")
beta[N,1] <- 1
beta[N,2] <- 1

The beta recursion

# From the 2nd step to all next steps
for(n in 1:(N-1)){
  for(k in 1:K){
    j <- 1:K
    beta[N-n,k] <- sum(beta[N-n+1,k] * dnorm(nokki$measured[N-n+1], mu[j], sigma[j]) * A[j,k])
  }
}
beta

##            k=1          k=2
## 1 0.0009117715 0.0009117715
## 2 0.0026468486 0.0026468486
## 3 0.0066346687 0.0066346687
## 4 0.0217736172 0.0217736172
## 5 0.0547515214 0.0547515214
## 6 0.1387168275 0.1387168275
## 7 0.3881356717 0.3881356717
## 8 1.0000000000 1.0000000000

Evaluation of the likelihood \(p(x_{1:N})\) as well as \(\gamma(z_n)\) and \(\xi(z_{n-1},z_n)\)

As mentioned, \(p(x_{1:N})\) is implicitly conditioned on \(\theta^\text{ old}\) and is thus really a likelihood, that of our HMM. We implement the below explained formula for the likelihood:

\[ p(x_{1:N})=\sum_{z_N}\alpha(z_N)\,. \]

# Evaluation of the likelihood
j <- 1:K
likelihood <- sum(alpha[N,j])

For the evaluation of \(\gamma\), we define a \((N,K)\)-dimensional dataset and implement the formula:

\[ \gamma(z_n) = \frac{\alpha(z_n)\cdot \beta(z_n)}{p(x_{1:N)}} \]

# Evaluation of gamma
gamma <- data.frame(matrix(0, nrow = N, ncol = K))
colnames(gamma) <- c("k=1", "k=2")
n <- 1:N
k <- 1:K
gamma[n,k] <- alpha[n,k]*beta[n,k]/likelihood
gamma

##   k=1 k=2
## 1 0.5 0.5
## 2 0.5 0.5
## 3 0.5 0.5
## 4 0.5 0.5
## 5 0.5 0.5
## 6 0.5 0.5
## 7 0.5 0.5
## 8 0.5 0.5

Note

In the M-step, the quantity \(p(x_{1:N})\) will cancel out in the equation for \(\mu_k\):

\[ \begin{aligned} \mu_k &= \frac{\sum_{n=1}^N \gamma(z_{nk})\cdot x_n}{\sum_{n=1}^N \gamma(z_{nk})} \\ &=\frac{\sum_{n=1}^N \alpha(z_{nk})\cdot \beta(z_{nk})\cdot x_n}{\sum_{n=1}^N \alpha(z_{nk})\cdot \beta(z_{nk})}\quad. \end{aligned} \]

# Calculation of K-dimensional vector mu
k <- 1
n <- 1:N
for(k in 1:K){
  numerator <- sum(alpha[n,k]*beta[n,k]*nokki$measured[n])
  denominator <- sum(alpha[n,k]*beta[n,k])
  mu[k] <- numerator/denominator
}
mu

## [1] -0.08810001 -0.08810001

However as mentioned above, \(p(x_{1:N})\) represents the likelihood function whose value we need in order to stop the EM algorithm, that knowingly stops when change of the likelihood from step to step becomes less than a given threshold. That’s why we need to evaluate the likelihood :

\[ p(x_{1:N})=\sum_{z_n}\alpha(z_n)\cdot\beta(z_n)=\sum_{z_N}\alpha(z_N)\,. \]

Evaluation of \(\xi(z_{n-1},\,z_{n})\)

\[ \begin{aligned} \xi(z_{n-1},z_n) &= p(z_{n-1},z_n\,|\,x_{1:N}) \\ &= \frac{1}{p(x_{1:N})}\cdot p(x_{1:n-1}\,|\,z_{n-1})\cdot p(x_n|z_n)\cdot p(x_{n+1:N}\,|\,z_n)\cdot p(z_n|z_{n-1})\cdot p(z_{n-1}) \\ &= \frac{1}{p(x_{1:N})}\cdot \Big[\alpha(z_{n-1})\cdot p(x_n|z_n)\cdot p(z_n|z_{n-1}))\cdot \beta(z_n)\Big] \end{aligned} \]

Thus we can calculate the \(\xi(z_{n-1},\,z_{n})\) directly by using the results of \(\alpha\) and \(\beta\) recursions.

For the evaluation of \(\xi\),

# Evaluation of xi
## Definition of the shape of the dataset
xi <- data.frame(matrix(0, nrow = K*K*N, ncol = 4))
colnames(xi) <- c("n","z_{n-1}", "z_n","xi")
for(n in 1:N){
  for(j in 1:K){
    for(k in 1:K){
      xi[K*K*(n-1)+K*(j-1)+k,1] <- n
      xi[K*K*(n-1)+K*(j-1)+k,2] <- j
      xi[K*K*(n-1)+K*(j-1)+k,3] <- k
    }
  }
}
## Filling with values for the variable xi
for(n in 1:N){
  for(j in 1:K){
    for(k in 1:K){
      xi[K*K*(n-1)+K*(j-1)+k,4] <- (alpha[n,k]*dnorm(nokki$measured[n],mu[k],sigma[k])
                    *A[j,k]*beta[n,k])/likelihood
    }
  }
}
xi

##    n z_{n-1} z_n         xi
## 1  1       1   1 0.19798785
## 2  1       1   2 0.09899393
## 3  1       2   1 0.00000000
## 4  1       2   2 0.09899393
## 5  2       1   1 0.16357233
## 6  2       1   2 0.08178617
## 7  2       2   1 0.00000000
## 8  2       2   2 0.08178617
## 9  3       1   1 0.19871572
## 10 3       1   2 0.09935786
## 11 3       2   1 0.00000000
## 12 3       2   2 0.09935786
## 13 4       1   1 0.16190525
## 14 4       1   2 0.08095262
## 15 4       2   1 0.00000000
## 16 4       2   2 0.08095262
## 17 5       1   1 0.19668607
## 18 5       1   2 0.09834304
## 19 5       2   1 0.00000000
## 20 5       2   2 0.09834304
## 21 6       1   1 0.19913446
## 22 6       1   2 0.09956723
## 23 6       2   1 0.00000000
## 24 6       2   2 0.09956723
## 25 7       1   1 0.18551322
## 26 7       1   2 0.09275661
## 27 7       2   1 0.00000000
## 28 7       2   2 0.09275661
## 29 8       1   1 0.18936569
## 30 8       1   2 0.09468285
## 31 8       2   1 0.00000000
## 32 8       2   2 0.09468285

Summary of the required steps for fitting our HMM

To fit our HMM using the EM algorithm, we first initialize the parameters \(\theta:=(\underline{\pi},A,\underline{\phi})\) by setting them equal to an arbitrary choice \(\theta^\text{ old}\). The \(\underline{\pi}\) and \(A\) parameters can be initialized uniformly or randomly from a uniform distribution, respecting their non-negativity and sum-to-one constraints.

Initialization of the parameters of our gaussian observation model \(\underline{\phi}=\{\underline{\mu},\underline{\sigma}\}=\{(\mu_1,\ldots,\mu_K)^T,\, (\sigma_1,\ldots,\sigma_K)^T\}\) can be done by applying the \(K\)-means algorithm to the data, and \(\sigma_k\) can be initialized to be the variance of the corresponding cluster.

Then we run both the forward \(\alpha\) recursion and the backward \(\beta\) recursion and use the results to evaluate \(\gamma(z_n)\) and \(\xi(z_{n-1},\,z_{n})\). At this stage, we evaluate the likelihood function as well. This completes the E-step. We use the results to find a revised set of parameters \(\theta^\text{ new}\) using the M-step equations. We then continue to alternate between E- and M-steps until the change in the likelihood function is below some threshold.

The running EM algorithm

Now that we have explained and coded each substep of the EM algorithm for HMM, it remains to put it all together.

# Initialization
N <- 8 # For the sake of simplicity, we consider only the first nodes of the Markov chain.
K <- 2
pii <- vector("numeric",K)
pii <- c(0.5,0.5)
A <- matrix(vector("numeric",K), nrow = K, ncol = K)
A <- matrix(c(0.25, 0.35, 0.75, 0.65), nrow = 2, ncol = 2)
mu <- vector("numeric",K)
mu <- c(0,0)
sigma <- vector("numeric",K)
sigma <- c(1,1)

for(i in 1:1){  #Inserting the above code into a loop is a trick
                #that helps for getting it printed in only one block
  cat("pii\n")
  print(pii)
  cat("Markov transition matrix \n")
  print(A)
  cat("mu\n")
  print(mu)
  cat("sigma\n")
  print(sigma)
  cat("\n")
}

## pii
## [1] 0.5 0.5
## Markov transition matrix 
##      [,1] [,2]
## [1,] 0.25 0.75
## [2,] 0.35 0.65
## mu
## [1] 0 0
## sigma
## [1] 1 1

library(dplyr)
# tables for monitoring parameter convergence
store_pii <- data.frame(t(pii))
store_mu <- data.frame(t(mu))
store_sigma <- data.frame(t(sigma))
store_A <- data.frame(t(c(A)))
store_likelihood <- data.frame(likelihood)

i <- 1
numberOfLoops <- 20
while(i < numberOfLoops){
  # E-step
  ## Alpha
  ### Initialization
  alpha <- data.frame(matrix(0, nrow = N, ncol = K))
  colnames(alpha) <- c("k=1", "k=2")
  alpha[1,1] <- pii[1]*dnorm(nokki$measured[1], mu[1], sigma[1])
  alpha[1,2] <- pii[2]*dnorm(nokki$measured[1], mu[2], sigma[2])
  ### Forward recursion
  for(n in 2:N){
    for(k in 1:K){
      j <- 1:K
      alpha[n,k] <- dnorm(nokki$measured[n],mu[k],sigma[k]) * sum(alpha[n-1,j]*A[j,k])
    }
  }
  ## Beta
  ### Initialization
  beta <- data.frame(matrix(0, nrow = N, ncol = K))
  colnames(beta) <- c("k=1", "k=2")
  beta[N,1] <- 1
  beta[N,2] <- 1
  ### Backward recursion
  for(n in 1:(N-1)){
    for(k in 1:K){
      j <- 1:K
    beta[N-n,k] <- sum(beta[N-n+1,k] * dnorm(nokki$measured[N-n+1], mu[j], sigma[j]) * A[j,k])
    }
  }
  ## Likelihood
  j <- 1:K
  likelihood <- sum(alpha[N,j])
  ## Gamma
  gamma <- data.frame(matrix(0, nrow = N, ncol = K))
  colnames(gamma) <- c("k=1", "k=2")
  n <- 1:N
  k <- 1:K
  gamma[n,k] <- alpha[n,k]*beta[n,k]/likelihood
  ## Xi
  ### Definition of the shape of xi
  xi <- data.frame(matrix(0, nrow = K*K*N, ncol = 4))
  colnames(xi) <- c("n_value","z_n_moinsUn", "z_n","xi")
  for(n in 1:N){
    for(j in 1:K){
      for(k in 1:K){
        xi[K*K*(n-1)+K*(j-1)+k,1] <- n
        xi[K*K*(n-1)+K*(j-1)+k,2] <- j
        xi[K*K*(n-1)+K*(j-1)+k,3] <- k
      }
    }
  }
  xi <- filter(xi,n_value>1)
  ### Filling dataset xi with xi-values
  for(n in 2:N){
    for(j in 1:K){
      for(k in 1:K){
        xi[K*K*(n-2)+K*(j-1)+k,4] <- (alpha[n-1,k]*dnorm(nokki$measured[n],mu[k],sigma[k])
                      *A[j,k]*beta[n,k])/likelihood
      }
    }
  }
  
  # M-step
  ## Pii
  kk <- 1:K
  for(k in 1:K){
    numerator <- gamma[1,k]
    denominator <- sum(gamma[1,kk])
    pii[k] <- numerator/denominator
  }
  ## A
  for(k in 1:K){
    for(j in 1:K){
      numerator <- 0
      for(n in 2:N){
        numerator <- numerator + filter(xi,n_value==n & z_n_moinsUn==j & z_n==k)[[4]]
      }
      denominator <- 0
      for(jj in 1:K){
        for(n in 2:N){
          denominator <- denominator + filter(xi,n_value==n & z_n_moinsUn==j & z_n==jj)[[4]]
        }
      }
      A[k,j] <- numerator/denominator
    }
  }
  ## Mu
  n <- 1:N
  for(k in 1:K){
  numerator <- sum(alpha[n,k]*beta[n,k]*nokki$measured[n])
  denominator <- sum(alpha[n,k]*beta[n,k])
  mu[k] <- numerator/denominator
  }
  ## Sigma
  for(k in 1:K){
    numerator <- 0
    for(n in 1:N){
      numerator <- numerator + 
        gamma[n,k]*(nokki$measured[n]-mu[k])%*%t(nokki$measured[n]-mu[k])
    }
    denominator <- sum(gamma[n,k])
    sigma[k] <- numerator/denominator
  }
  
  if(abs((i/numberOfLoops<0.2))||(abs(numberOfLoops-i)/numberOfLoops<0.2)){
    cat("\n Step n = ",i,"\n")
    cat("===========\n")
    cat("pii\n")
    print(pii)
    cat("Markov transition matrix \n")
    print(A)
    cat("mu\n")
    print(mu)
    cat("sigma\n")
    print(sigma)
  }
 
  store_pii <- rbind(store_pii, data.frame(t(pii)))
  store_mu <- rbind(store_mu, data.frame(t(mu)))
  store_sigma <- rbind(store_sigma, data.frame(t(sigma)))
  store_A <- rbind(store_A, data.frame(t(c(A))))
  store_likelihood <- rbind(store_likelihood,data.frame(likelihood))
  
  i <- i+1
}

## 
##  Step n =  1 
## ===========
## pii
## [1] 0.002648566 0.997351434
## Markov transition matrix 
##            [,1]       [,2]
## [1,] 0.01706358 0.02727781
## [2,] 0.98293642 0.97272219
## mu
## [1] -0.06597361 -0.02371714
## sigma
## [1] 0.2698719 5.0864943
## 
##  Step n =  2 
## ===========
## pii
## [1] 0.02564032 0.97435968
## Markov transition matrix 
##           [,1]       [,2]
## [1,] 0.9694939 0.98089322
## [2,] 0.0305061 0.01910678
## mu
## [1] -0.0225262 -0.1016069
## sigma
## [1] 0.2924687 1.3919037
## 
##  Step n =  3 
## ===========
## pii
## [1] 0.09203595 0.90796405
## Markov transition matrix 
##           [,1]      [,2]
## [1,] 0.3668192 0.4834276
## [2,] 0.6331808 0.5165724
## mu
## [1] -0.0779264 -0.2744360
## sigma
## [1] 1.061383 5.450796
## 
##  Step n =  17 
## ===========
## pii
## [1] 0.2140877 0.7859123
## Markov transition matrix 
##           [,1]      [,2]
## [1,] 0.3830684 0.5007572
## [2,] 0.6169316 0.4992428
## mu
## [1] -0.07025889 -0.10029462
## sigma
## [1] 1.149351 1.089568
## 
##  Step n =  18 
## ===========
## pii
## [1] 0.2111275 0.7888725
## Markov transition matrix 
##          [,1]      [,2]
## [1,] 0.382485 0.5001398
## [2,] 0.617515 0.4998602
## mu
## [1] -0.07006379 -0.10037659
## sigma
## [1] 1.149705 1.089326
## 
##  Step n =  19 
## ===========
## pii
## [1] 0.2081668 0.7918332
## Markov transition matrix 
##           [,1]      [,2]
## [1,] 0.3819412 0.4995641
## [2,] 0.6180588 0.5004359
## mu
## [1] -0.06986963 -0.10046134
## sigma
## [1] 1.150014 1.089107

Consider the convergence of the likelihood after only a few steps:

##      likelihood
## 1  3.557905e-04
## 2  3.557905e-04
## 3  4.865013e-08
## 4  6.044383e-04
## 5  8.356111e-07
## 6  1.029319e-04
## 7  8.724705e-05
## 8  1.416213e-04
## 9  1.682182e-04
## 10 1.753885e-04
## 11 1.782563e-04
## 12 1.797233e-04
## 13 1.805805e-04
## 14 1.811489e-04
## 15 1.815742e-04
## 16 1.819260e-04
## 17 1.822392e-04
## 18 1.825318e-04
## 19 1.828134e-04
## 20 1.830895e-04

We finally compute the predictive distribution: the observed data is \(\{x_1,\ldots,x_N\}\) and we wish to predict \(x_{N+1}\), which is important for real-time applications like, in our case, forecasting.

\[ \begin{aligned} p(x_{N+1}\,|\,x_{1:N}) &= \sum_{z_{N+1}} p(x_{N+1},z_{N+1}\,|\,x_{1:N}) \\ &= \sum_{z_{N+1}} p(x_{N+1}\,|\,z_{N+1},\,x_{1:N})\cdot p(z_{N+1}\,|\,x_{1:N}) \\ &= \sum_{z_{N+1}}\left[ p(x_{N+1}\,|\,z_{N+1})\cdot \sum_{z_N} p(z_{N+1},\,z_N\,|\,x_{1:N}) \right] \\ &= \sum_{z_{N+1}} \left[ p(x_{N+1}\,|\,z_{N+1})\cdot \sum_{z_N} p(z_{N+1}\,|\,z_N)\cdot p(z_N\,|\,x_{1:N}) \right] \\ &= \sum_{z_{N+1}} \left[ p(x_{N+1}\,|\,z_{N+1})\cdot \sum_{z_N} p(z_{N+1}\,|\,z_N)\cdot \frac{p(z_N,\,x_{1:N})}{p(x_{1:N})} \right] \\ &= \frac{1}{p(x_{1:N})}\cdot \sum_{z_{N+1}} \left[ p(x_{N+1}\,|\,z_{N+1}) \cdot \sum_{z_N} p(z_{N+1}\,|\,z_N) \cdot \alpha(z_N) \right] \,. \end{aligned} \]

This quantity can be evaluated by first running a forward alpha recursion and then computing the summations over \(z_N\) (\(K\) terms) and \(z_{N+1}\) (\(K\) terms). The predicted \(x_{N+1}\) can then eventually be used to predict the subsequent value \(x_{N+2}\).

We therefore run an alpha recursion using the obtained parameter values for the hidden Markov chain \(\pi\) and \(A\ldots\)

pii

## [1] 0.2081668 0.7918332

##           [,1]      [,2]
## [1,] 0.3819412 0.4995641
## [2,] 0.6180588 0.5004359

\(\ldots\) as well as for the observation model \(\mu\) and \(\sigma\):

mu

## [1] -0.06986963 -0.10046134

sigma

## [1] 1.150014 1.089107

## Alpha
  ### Initialization
  alpha <- data.frame(matrix(0, nrow = N, ncol = K))
  colnames(alpha) <- c("k=1", "k=2")
  alpha[1,1] <- pii[1]*dnorm(nokki$measured[1], mu[1], sigma[1])
  alpha[1,2] <- pii[2]*dnorm(nokki$measured[1], mu[2], sigma[2])
  ### Forward recursion
  for(n in 2:N){
    for(k in 1:K){
      j <- 1:K
      alpha[n,k] <- dnorm(nokki$measured[n],mu[k],sigma[k]) * sum(alpha[n-1,j]*A[j,k])
    }
  }

So far we have made the alpha recursion.

We now want to use it to compute the predictive expression that we obtained above. To do that, we proceed in three steps.
At first we compute the scalar: \[ \frac{1}{p(x_{1:N})} \]

As explained before, the denominator is the likelihood:

1/likelihood

## [1] 5461.808

Secondly we compute the \(K\)-dimensional vector (the terms are the coefficients of a \((K,K)\) matrix, and the sum is a vector): \[ \sum_{z_N} p(z_{N+1}\,|\,z_N) \cdot \alpha(z_N) \]

k=1:K
sum1 <- 0
for(j in 1:K){
  sum1 <- sum1 + A[k,j]*alpha[N,j]
}
sum1

## [1] 8.101248e-05 1.023506e-04

Thirdly we compute the scalar: \[ \sum_{z_{N+1}} \left[ p(x_{N+1}\,|\,z_{N+1}) \cdot \sum_{z_N} p(z_{N+1}\,|\,z_N) \cdot \alpha(z_N) \right] \]

sum2 <- 0
for(k in 1:K){
  sum1 <- 0
  for(j in 1:K){
    sum1 <- sum1 + A[k,j]*alpha[N,j]
  }
  sum2 <- sum2 + dnorm(nokki$measured[N],mu[k],sigma[k]) * sum1
}
sum2

## [1] 6.28978e-05

Putting it all together, we get the prediction for the next observation \(x_{N+1}\):

sum2 <- 0
for(k in 1:K){
  sum1 <- 0
  for(j in 1:K){
    sum1 <- sum1 + A[k,j]*alpha[N,j]
  }
  sum2 <- sum2 + dnorm(nokki$measured[N],mu[k],sigma[k]) * sum1
}
prediction <- sum2/likelihood
prediction

## [1] 0.3435357

Comparing with the dataset value at corresponding line \(N+1\):

nokki$measured[N+1]

## [1] -0.08156015

In order to measure the prediction rate, one has to test the prediction method many times and do statistics.

Testing the predictive method

We run the EM algorithm many times, each time for an increasing number of observations \(N\). Each run consists of the EM algorithm followed by the prediction procedure. Each run associates a prediction to a \(N\) value.

EM_algo <- function(N){
# Initialization
K <- 2
pii <- vector("numeric",K)
pii <- c(0.5,0.5)
A <- matrix(vector("numeric",K), nrow = K, ncol = K)
A <- matrix(c(0.25, 0.35, 0.75, 0.65), nrow = 2, ncol = 2)
mu <- vector("numeric",K)
mu <- c(0,0)
sigma <- vector("numeric",K)
sigma <- c(1,1)

# EM algorithm is running
library(dplyr)
# tables for monitoring parameter convergence
store_pii <- data.frame(t(pii))
store_mu <- data.frame(t(mu))
store_sigma <- data.frame(t(sigma))
store_A <- data.frame(t(c(A)))
store_likelihood <- data.frame(likelihood)

i <- 1
numberOfLoops <- 20
while(i < numberOfLoops){
  # E-step
  ## Alpha
  ### Initialization
  alpha <- data.frame(matrix(0, nrow = N, ncol = K))
  colnames(alpha) <- c("k=1", "k=2")
  alpha[1,1] <- pii[1]*dnorm(nokki$measured[1], mu[1], sigma[1])
  alpha[1,2] <- pii[2]*dnorm(nokki$measured[1], mu[2], sigma[2])
  ### Forward recursion
  for(n in 2:N){
    for(k in 1:K){
      j <- 1:K
      alpha[n,k] <- dnorm(nokki$measured[n],mu[k],sigma[k]) * sum(alpha[n-1,j]*A[j,k])
    }
  }
  ## Beta
  ### Initialization
  beta <- data.frame(matrix(0, nrow = N, ncol = K))
  colnames(beta) <- c("k=1", "k=2")
  beta[N,1] <- 1
  beta[N,2] <- 1
  ### Backward recursion
  for(n in 1:(N-1)){
    for(k in 1:K){
      j <- 1:K
    beta[N-n,k] <- sum(beta[N-n+1,k] * dnorm(nokki$measured[N-n+1], mu[j], sigma[j]) * A[j,k])
    }
  }
  ## Likelihood
  j <- 1:K
  likelihood <- sum(alpha[N,j])
  ## Gamma
  gamma <- data.frame(matrix(0, nrow = N, ncol = K))
  colnames(gamma) <- c("k=1", "k=2")
  n <- 1:N
  k <- 1:K
  gamma[n,k] <- alpha[n,k]*beta[n,k]/likelihood
  ## Xi
  ### Definition of the shape of xi
  xi <- data.frame(matrix(0, nrow = K*K*N, ncol = 4))
  colnames(xi) <- c("n_value","z_n_moinsUn", "z_n","xi")
  for(n in 1:N){
    for(j in 1:K){
      for(k in 1:K){
        xi[K*K*(n-1)+K*(j-1)+k,1] <- n
        xi[K*K*(n-1)+K*(j-1)+k,2] <- j
        xi[K*K*(n-1)+K*(j-1)+k,3] <- k
      }
    }
  }
  xi <- filter(xi,n_value>1)
  ### Filling dataset xi with xi-values
  for(n in 2:N){
    for(j in 1:K){
      for(k in 1:K){
        xi[K*K*(n-2)+K*(j-1)+k,4] <- (alpha[n-1,k]*dnorm(nokki$measured[n],mu[k],sigma[k])
                      *A[j,k]*beta[n,k])/likelihood
      }
    }
  }
  
  # M-step
  ## Pii
  kk <- 1:K
  for(k in 1:K){
    numerator <- gamma[1,k]
    denominator <- sum(gamma[1,kk])
    pii[k] <- numerator/denominator
  }
  ## A
  for(k in 1:K){
    for(j in 1:K){
      numerator <- 0
      for(n in 2:N){
        numerator <- numerator + filter(xi,n_value==n & z_n_moinsUn==j & z_n==k)[[4]]
      }
      denominator <- 0
      for(jj in 1:K){
        for(n in 2:N){
          denominator <- denominator + filter(xi,n_value==n & z_n_moinsUn==j & z_n==jj)[[4]]
        }
      }
      A[k,j] <- numerator/denominator
    }
  }
  ## Mu
  n <- 1:N
  for(k in 1:K){
  numerator <- sum(alpha[n,k]*beta[n,k]*nokki$measured[n])
  denominator <- sum(alpha[n,k]*beta[n,k])
  mu[k] <- numerator/denominator
  }
  ## Sigma
  for(k in 1:K){
    numerator <- 0
    for(n in 1:N){
      numerator <- numerator + 
        gamma[n,k]*(nokki$measured[n]-mu[k])%*%t(nokki$measured[n]-mu[k])
    }
    denominator <- sum(gamma[n,k])
    sigma[k] <- numerator/denominator
  }
  
  store_pii <- rbind(store_pii, data.frame(t(pii)))
  store_mu <- rbind(store_mu, data.frame(t(mu)))
  store_sigma <- rbind(store_sigma, data.frame(t(sigma)))
  store_A <- rbind(store_A, data.frame(t(c(A))))
  store_likelihood <- rbind(store_likelihood,data.frame(likelihood))
  
  i <- i+1
}

# Prediction for step N+1
## Alpha
  ### Initialization
  alpha <- data.frame(matrix(0, nrow = N, ncol = K))
  colnames(alpha) <- c("k=1", "k=2")
  alpha[1,1] <- pii[1]*dnorm(nokki$measured[1], mu[1], sigma[1])
  alpha[1,2] <- pii[2]*dnorm(nokki$measured[1], mu[2], sigma[2])
  ### Forward recursion
  for(n in 2:N){
    for(k in 1:K){
      j <- 1:K
      alpha[n,k] <- dnorm(nokki$measured[n],mu[k],sigma[k]) * sum(alpha[n-1,j]*A[j,k])
    }
  }
## Implementation of the predictive formula for x_{N+1}
sum2 <- 0
for(k in 1:K){
  sum1 <- 0
  for(j in 1:K){
    sum1 <- sum1 + A[k,j]*alpha[N,j]
  }
  sum2 <- sum2 + dnorm(nokki$measured[N],mu[k],sigma[k]) * sum1
}
prediction <- sum2/likelihood
return(prediction)
}

EM_algo(8)

## [1] 0.3435357

store_predict <- data.frame()
for(N in 10:16){
  store_predict <- rbind(store_predict,c(N,EM_algo(N)))
}
colnames(store_predict) <- c("step N", "prediction for step N+1")
store_predict

##   step N prediction for step N+1
## 1     10               0.3325959
## 2     11               0.3243835
## 3     12               0.1947939
## 4     13               0.1897608
## 5     14               0.1920680
## 6     15               0.1882866
## 7     16               0.1753298

History of the involved concepts

The problem of estimating the parameters of a multivariate data set which contains unobserved data can be managed by decomposing the original estimation problem into smaller estimation problems by factoring the likelihood of the data into a product of likelihoods ((Rubin, 1974)). We are here considering incomplete data: let us consider two sample spaces Y and X and a many-to-one mapping X -> Y. The observed data is a realization y of Y. The corresponding x in X is not observed directly, but only through y. An incomplete data in that sense is composed of both x and y, as well as a probability model describing the relationship between x and y. The problem is that the factorization of its likelihood is not always possible. Whether factorization is possible or not depends on the d-separation properties of the directed acyclic graph associated with the joint distribution of the incomplete data.

(Dempster et al., 1977) introduces a new method for computing a maximum likelihood estimates from incomplete data, that can be used when factorization of the incomplete data is not possible. The method consists in computing iteratively the maximum likelihood estimates. Each iteration of the algorithm consists of an expectation step (E-step) followed by a maximization step (M-step), which is why the method is called the EM algorithm.

(Baum and Petrie, 1966) first introduces the basic theory of hidden Markov models (HMM). This article deals with statistical inference related to finite-state Markov chains, leaving aside the question of the fit of a Markov chain for a given data set. Typically the HMM does not allow the factorization of the likelihood of the incomplete data. (Rabiner, 1989) introduces hidden Markov models for speech recognition tasks. The article covers the basics of HMMs, their application to speech recognition as a sequence labeling problem, and the modeling of temporal dynamics and variability of speech. It also addresses the advantages and disadvantages of using HMMs for speech recognition. (GHAHRAMANI, 2011) shows possible extensions of the use of HMMs considering multiple hidden state variables, multiscale representations, and mixed discrete and continuous variables.

The EM algorithm is typically used to fit mixture models. Baum-Welch (Baum et al., 1970) shows that using the forward-backward algorithm the EM algorithm can also be used to fit MMH, thus avoiding the difficulty around factorization of the likelihood. The adapted form of the EM algorithm to fit HMM is called therefore the Baum-Welch algorithm.

Baum, L.E., Petrie, T., 1966. Statistical Inference for Probabilistic Functions of Finite State Markov Chains. The Annals of Mathematical Statistics 37, 1554–1563.
Baum, L.E., Petrie, T., Soules, G., Weiss, N., 1970. A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains. The Annals of Mathematical Statistics 41, 164–171.
Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39, 1–38.
GHAHRAMANI, Z., 2011. AN INTRODUCTION TO HIDDEN MARKOV MODELS AND BAYESIAN NETWORKS. International Journal of Pattern Recognition and Artificial Intelligence. https://doi.org/10.1142/S0218001401000836
Rabiner, L.R., 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77, 257–286. https://doi.org/10.1109/5.18626
Rubin, D.B., 1974. Characterizing the Estimation of Parameters in Incomplete-Data Problems. Journal of the American Statistical Association 69, 467–474. https://doi.org/10.2307/2285680